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CN ■ Abstract 

O ' The mathematical theory of reproducing kernel Hilbert spaces (RKHS) provides powerful tools for minimum 

variance estimation (MVE) problems. Here, we extend the classical RKHS-based analysis of MVE in several 
. directions. We develop a geometric formulation of five known lower bounds on the estimator variance (Barankin 

bound, Cramer-Rao bound, constrained Cramer-Rao bound, Bhattacharya bound, and Hammersley-Chapman- 
Robbins bound) in terms of orthogonal projections onto a subspace of the RKHS associated with a given 
MVE problem. We define the property of differentiability of an RKHS and demonstrate its close relation to the 
subspace associated with the Cramer-Rao bound. We show that, under mild conditions, the Barankin bound (the 
tightest possible lower bound on the estimator variance) is a lower semi-continuous function of the parameter 
vector. We also show that the RKHS associated with an MVE problem remains unchanged if the observation is 
replaced by a sufficient statistic. Finally, for MVE problems conforming to an exponential family of distributions, 
we derive novel closed-form lower bounds on the estimator variance and show that a reduction of the parameter 

in 

\^ ■ set leaves the minimum achievable variance unchanged. 

o: 

Index Terms 

Minimum variance estimation, exponential families, RKHS, Cramer-Rao bound, Barankin bound, Hammersley- 
Chapman-Robbins bound, Bhattacharya bound, locally minimum variance unbiased estimator. 

I. Introduction 

We consider the problem of estimating the value g(x) of a known deterministic function g(-) evaluated 
at an unknown nonrandom parameter vector x G X, where the parameter set X is known. The estimation of 
g(x) is based on an observed vector y, which is modeled as a random vector with an associated probability 
measure HI ^ or, as a special case, an associated probability density function (pdf) /(y; x), both parametrized 
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by x G X. More specifically, we study the problem of minimum variance estimation (MVE), where one aims 
at finding estimators with minimum variance under the constraint of a prescribed bias. 

Our treatment of MVE will be based on the mathematical framework and methodology of reproducing 
kernel Hilbert spaces (RKHS). The RKHS approach to MVE was introduced in the seminal papers Q and 
151 . A specialization to estimation problems involving sparsity constraints was presented in Ifflj-O. The RKHS 
approach to MVE enables a consistent and intuitive geometric treatment of the MVE problem. In particular, the 
determination of the minimum achievable variance (or Barankin bound) and of the locally minimum variance 
estimator reduces to the computation of the squared norm and isometric image of a specific vector — representing 
the prescribed estimator bias — that belongs to the RKHS associated with the estimation problem. Furthermore, a 
wide class of lower bounds on the minimum achievable variance (and, in turn, on the variance of any estimator) 
is obtained by performing projections onto subspaces of the RKHS. The RKHS approach has also proven to 
be a valuable tool for the analysis of estimation problems involving continuous-time random processes (3, 0, 
GJ. 

The main contributions of this paper concern an RKHS -theoretic analysis of the performance of MVE, 
with a focus on questions related to lower variance bounds, sufficient statistics, and observations conforming 
to an exponential family of distributions. First, we give a geometric interpretation of some well-known lower 
bounds on the estimator variance. The tightest of these bounds, i.e., the Barankin bound, is proven to be a 
lower semi-continuous function of the parameter vector x under mild conditions. We then analyze the role 
of a sufficient statistic from the RKHS viewpoint. In particular, we prove that the RKHS associated with an 
estimation problem remains unchanged if the observation y is replaced by any sufficient statistic. Furthermore, 
we characterize the RKHS for estimation problems with observations conforming to an exponential family of 
distributions. It is found that this RKHS has a strong structural property, and that it is explicitly related to the 
moment-generating function of the exponential family. Inspired by this relation, we derive novel lower bounds 
on the estimator variance, and we analyze the effect of parameter set reductions. The lower bounds have a 
particularly simple form. 

The remainder of this paper is organized as follows. Basic elements of MVE are reviewed in Section HO 
and the RKHS approach to MVE is summarized in Section |lll] In Section |IVl we present an RKHS-based 
geometric interpretation of known variance bounds and demonstrate the lower semi-continuity of the Barankin 
bound. The effect of replacing the observation by a sufficient statistic is studied in Section [Vj In Section [Vl] the 
RKHS for exponential family-based estimation problems is investigated, novel lower bounds on the estimator 
variance are derived, and the effect of a parameter set reduction is analyzed. We note that the proofs of most 
of the new results presented can be found in the doctoral dissertation |8] and will be referenced in each case. 

Notation and basic definitions. We will use the shorthand notations N = {1, 2, 3, . . . }, Z + = {0, 1,2,.. .}, 
and [N] = {1,2,..., N}. The open ball in 1^ with radius r > and centered at x c is defined as £>(x c , r) = 
{x € | ||x — x c || 2 < r}. We call x € X C M N an interior point if £>(x, r) C X for some r > 0. The set of 
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all interior points of X is called the interior of X and denoted X°. A set X is called open if X = X°. 

Boldface lowercase (uppercase) letters denote vectors (matrices). The superscript T stands for transposition. 
The kth entry of a vector x and the entry in the kth row and Ith column of a matrix A are denoted by (x) fc = xu 
and (A) fc j = Ak t i, respectively. The kth unit vector is denoted by e^, and the identity matrix of size NxN 
by Ijv- The Moore-Penrose pseudoinverse (9l of a rectangular matrix F 6 R MxAr is denoted by Ft. 

A function /(■) : V — > R, with £> C M N , is said to be lower semi-continuous at xo 6 P if for every 
e > there is a radius r > such that /(x) > /(xo) — e for all x € £>(xo,r). (This definition is equivalent 
to liminf x _>. Xo /(x) > /(x ), where limmfx^ /(x) = sup r>0 {inf xgI > n [0( Xo , r )\{ Xo }] /(x)} HH, E1-) 
The restriction of a function /(•) : V — > R to a subdomain D'CP is denoted by /(-)|xv Given a multi-index 
P = (pi • • • PAf) T £ ^+ > we define the partial derivative of order p of a real-valued function /(•) : V — Y R, with 
£> C R^, as 4 . . . J?^/( x ) ( if k exists ) El, HI- Similarly, for a function /(•,-) :DxD-}l 

and two multi-indices pi,P2 6 Z+, we denote by d P1 ^ P2 p /(xi,x 2 ) ^ p art j a j derivative of order (pi,P2), where 
/(xi,X2) is considered as a function of the "super-vector" (x^x^) T of length 2N. Given a vector-valued 
function </>(•) : R M ^ M N and p € 7L%, we denote the product J^fcLi (^*(y)) P * b Y p (y)- 

The probability measure of a random vector y taking on values in R M is denoted by Q, iflH-lfBH. We 
consider probability measures that are defined on the measure space given by all M-dimensional Borel sets on 
R M HI Sec. 10]. The probability measure assigns to a measureable set A C R M the probability 

P{y€.4} 4 / W) W) = / W), 

Jr m J a 

where Ia(-) : R A/ — > {0, 1} denotes the indicator function of the set A. We will also consider a family of 
probability measures {p%.} xe x parametrized by a nonrandom parameter vector x£^. We assume that there 
exists a dominating measure so that we can define the pdf /(y; x) (again parametrized by x) as the Radon- 
Nikodym derivative of the measure /i x with respect to the measure /X£ |Q]], lfT3l - lfT51 . (i n general, we will 
choose for fi£ the Lebesgue measure on R A/ .) We refer to both the set of measures {/^ x } xg ^ and the set of 
pdfs {/ (y; x)} xgA . as the statistical model. Given a (possibly vector-valued) deterministic function t(y), the 
expectation operation is defined by [T| 

E x {t(y)} 4 f t(y')^ x (y') = / t(y')/(y';x) ( iy', (1) 

where the subscript in E x indicates the dependence on the parameter vector x parametrizing /U x (y) and /(y; x). 

II. MVE BASICS 

We consider the estimation of a function value g(x) from an observed vector y, where the deterministic 
parameter vector x£ A'CR^ is unknown except for the fact that it belongs to a known parameter set X, and 
the deterministic parameter function g(-) : X — > R p is known. Furthermore, the random observation y 6 R A/ is 
distributed according to the parametrized set of pdfs (the statistical model) {/(y; x)} xg;t >. It will be convenient 



4 



to denote this classical (frequentist) estimation problem by the triple £ = (X, /(y; x), g(-)) ■ Note that our 
setting includes estimation of the parameter vector x itself, which is obtained when g(x) = x. 

The result of estimating g(x) from y is an estimate g G M p , which is derived from y via a deterministic 
estimator g(-) : M M — > M p , i.e., g = g(y). We assume that any estimator is a measurable mapping from R M 
to M p HJ Sec. 13]. The general goal in the design of an estimator g(-) is that g(y) be close to the true value 
g(x). A convenient performance criterion is the mean squared error (MSE) defined as 

e ± E x {||g(y)-g(x)||2} = f ||g(y)-g(x)||2/(y;x)dy. 

JR M 

We will write e(g(-);x) to explicitly indicate the dependence of the MSE on the estimator g(-) and the 
parameter vector x. Unfortunately, for a general estimation problem £ = (X, /(y;x),g(-)j, there does not 
exist an estimator g(-) that minimizes the MSE simultaneously for all parameter vectors xG^f lfT6*l . iTTTi l. This 
follows from the fact that minimizing the MSE at a given parameter vector xo always yields zero MSE; this 
is achieved by the estimator go(y) = xo, which completely ignores the observation y. 

A popular rationale for the design of good estimators is MVE. This approach is based on the MSE 
decomposition 

e(g(-);x) = ||b(g(.);x)||2+t,(g(.);x), (2) 

with the estimator bias b(g(-);x) = E x {g(y)} — g(x) and the estimator variance ^(g(-);x) = E x {||g(y) 
— E x {g(y)}|| 2 |. In MVE, one fixes the bias for all parameter vectors, i.e., b(g(-);x) = c(x) for all x G X, 
with a prescribed bias function c(-) : X — > M p , and considers only estimators with the given bias. Note that 
fixing the estimator bias is equivalent to fixing the estimator mean, i.e., E x {g(y)} = 7(x) for all x G X, 
with the prescribed mean function 7(x) = c(x) + g(x). The important special case of unbiased estimation 
is obtained for c(x) = or equivalently 7(x) = g(x) for all x G X. Fixing the bias can be viewed as 
a kind of regularization of the set of considered estimators lfT3l . ifTTl . because useless estimators like the 
estimator go (y ) = xo are excluded. Another justification for fixing the bias is the fact that, if a large number 
of independent and identically distributed (i.i.d.) realizations {yi}i =1 of the vector y are observed, then, under 
certain technical conditions, the bias term dominates in the decomposition (f2]). Thus, in that case, the MSE 
is small if and only if the bias is small; this means that the estimator has to be effectively unbiased, i.e., 
b(g(-);x) rj for all x<EX. 

For a fixed "reference" parameter vector xo G X and a prescribed bias function c(-), we define the set of 
allowed estimators by 

A(c(-),x ) 4 {g(.)|«(g(-);x )<oo,b(g(-);x)=c(x) VxG*}. 

We call a bias function c(-) valid for the estimation problem £ = (X, /(y; x), g(-)) at xo G X if the set 
*4(c(-),xo) is nonempty. This means that there is at least one estimator g(-) with finite variance at xo and 
whose bias equals c(-), i.e., b(g(-);x) = c(x) for all x G X. From Q, it follows that for a fixed bias c(-), 
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minimizing the MSE e(g(-);xo) is equivalent to minimizing the variance u(g(-);xo). Therefore, in MVE, one 
attempts to find estimators that minimize the variance under the constraint of a prescribed bias c( ) function. 
Let 

M(c(.),x ) 4 inf V (g(-);xo) (3) 
g(-)e^(c(-),x ) 

denote the minimum (strictly speaking, infimum) variance at xo for bias function c(-). If «4(c(-),xo) is empty, 
i.e., if c(-) is not valid, we set M(c(-),xo) = oo. Any estimator g^ x °^(-) G A(c(-),xq) that achieves the 
infimum in (O, i.e., for which ?;(g( Xo )(-); xo) = M(c(-),xo), is called a locally minimum variance (LMV) 
estimator at xo for bias function c(-) J2j, Q, lfT3l . The corresponding minimum variance M(c(-),xo) is called 
the minimum achievable variance at xo for bias function c(-). The minimization problem ([3]) is referred to as 
a minimum variance problem (MVP). By its definition in ©, M(c(-),xo) is a lower bound on the variance at 
xo of any estimator with bias function c(-), i.e., 

g(')ei(c(-),xo) «(g(-);*b) >M(c(-),x ). (4) 

In fact, M(c(-),Xo) is the tightest lower bound, which is sometimes referred to as the Barankin bound. 

If, for a prescribed bias function c(-), there exists an estimator that is the LMV estimator simultaneously 
at all xo G X, then that estimator is called the uniformly minimum variance (UMV) estimator for bias function 
c(-) El; 0, l[T3l . For many estimation problems, a UMV estimator does not exist. However, it always exists if 
there exists a complete sufficient statistic lfT3l Theorem 1.11 and Corollary 1.12], lfT8l Theorem 6.2.25]. Under 
mild conditions, this includes the case where the statistical model corresponds to an exponential family. 

The variance to be minimized can be decomposed as 

^(g(-); x o) = ^2 v(m (■);*()) , 

ie[P) 

where &(•) 4 (g(.)) ; and «(&(•); xo) = E x {[^(y) " E x {^(y)}f} for / G [P]. Moreover, g(-) G ^(c(-),x ) 
if and only if <#(•) G «4(q(-),Xo) for all Z G [P], where q(-) = (c(-)),. It follows that the minimization 
of v(g(-);xo) can be reduced to P separate problems of minimizing the component variances u(^(-);xo), 
each involving the optimization of a single scalar component <?/(•) of g(-) subject to the scalar bias constraint 
b(gi (-); x ) = q( x ) for all x G X. Therefore, without loss of generality, we will hereafter assume that the 
parameter function g(x) is scalar-valued, i.e., P = l. 

III. The RKHS Approach to MVE 

A powerful mathematical toolbox for MVE is provided by RKHS theory 0, Q, |[T9l . In this section, we 
review basic definitions and results of RKHS theory and its application to MVE, and we discuss a differentiability 
property that will be relevant to the variance bounds considered in Section |Tv] 

An RKHS is associated with a kernel function, which is a function P(- , •) : X x X — > M with the following 
two properties |[T9l : 
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• It is symmetric, i.e., i?(xi,X2) = i?(x2,xi) for all xi,X2 G X. 

• For every finite set {xi, . . . , X£>} C X, the matrix R G M. DxD with entries R m)n = R(x m , x„) is positive 
semidefinite. 

There exists an RKHS for any kernel function R(- , ■) : X x X -> E [19]]. This RKHS, denoted is a 

Hilbert space equipped with an inner product (•, such that, for any xG X, 

• i?(-,x) G 'H(R) (here, i?(-,x) denotes the function / x ( x ') = i?(x', x) with a fixed x G 

• for any function /(•) G 'H(R), 

(f(-),R(-,x)) n(R) = f( X ). (5) 

Relation (f5]), which is known as the reproducing property, defines the inner product (/, g)^ R ^ for all /(•),<?(•) G 
because (in a certain sense) any /(•) G T~L(R) can be expanded into the set of functions {i?(-,x)} xeA .. 

A. The RKHS Associated with an MVP 

Consider the class of MVPs that is defined by an estimation problem £ = (X, /(y;x), g(-)), a reference 
parameter vector xo G X, and all possible prescribed bias functions c(-) : X — > R. With this class of MVPs, we 
can associate a kernel function Re^ (- , •) : ^ x X -s> M and, in turn, an RKHS 'H(i?£,x ) El, GJ. (Note that, 
as our notation indicates, i?£, Xo (- , •) and T-L{R£^ Q ) depend on £ and xo but not on c(-).) We assume that for 
all reference parameters xo G X for which the MVP (f3]) is considered, 

/(y;xo)^0, forallyGM M . (6) 

We can then define the likelihood ratio as 

/ % a /(yi x ) n > 

^ Xo(y ' x) = 7(y7^)- (7) 

We consider p£, Xo (y, x ) as a random variable (since it is a function of the random vector y) that is parametrized 
by x G X. Furthermore, we define the Hilbert space Cs jXo as the closure of the linear spanj of the set of random 
variables {p£, Xo (y, x )} xG/% - The topology of C £:X(t is determined by the inner product (• , -) RV : £ £xo x £ £xo ->■ 
M defined by 



N P£,x (y, x i),P£-,x (y, x 2)) RV = E X0 {pg jXo (y,xi)p£ iXb (y,x 2 )) = E Xo | /2( y . XQ ) 



(8) 



It can be shown that it is sufficient to define the inner product only for the random variables {p£ )Xo (y, x )} xg ;f 
O. We will assume that (p£, Xo (y,xi),p,r XQ (y,X2)) RV < oo for all xi,x 2 G X, or, equivalently, 

c f/(y; x i)/(y; x 2)\ , „ ^ v tK . 

E x „ \ h? ^ f < 00 ' for all xi, x 2 G ^ . (9) 

I / 2 (y; x o) J 

A variant of this assumption was also used in Q, 0, l2ll . ll22l . 

'A detailed discussion of the concepts of closure, inner product, orthonormal basis, and linear span in the context of abstract Hilbert 
space theory can be found in [2], l20l . 



7 



The inner product (•, -) RV : £s jXo x ££, Xo — > K can now be interpreted as a kernel function i?£, Xo (-, : 

o / \ a / / \ , v\ c f /(y; xi)/(y;x 2 ) > [ 

i?g )Xo (xi,x 2 ) = (p£,x (y,xi),p £|XD (y,x 2 )) RV = E Xo | j2( y . XQ ) /• (10) 

The RKHS induced by Rs,x (- , •) will be denoted by %£^ Q , i.e., V.£ :Xo = H(Re lXo ). This is the RKHS associated 
with the estimation problem £ = (X, /(y;x), #(•)) and the corresponding class of MVPs at xo € Af. 

We note that assumption ((6]) implies that the likelihood ratio p^ Xo (y,x) = jjz^k is measurable with 
respect to the underlying dominating measure fig. Furthermore, the likelihood ratio P£,x (y) x ) is the Radon- 
Nikodym derivative HI, |[T4l of the probability measure fj% induced by /(y;x) with respect to the probability 
measure fj£ induced by /(y;xo) (cf. Q, |[20l . [23]). It is also important to observe that /5£ !Xo (y,x) does not 
depend on the dominating measure p,g underlying the definition of the pdfs /(y; x). Thus, the kernel Rs jXo (- , •) 
given by (ITOb does not depend on fig either. Moreover, under assumption ©, we can always use the measure 
/U Xo as the base measure p,£ for the estimation problem £, since the Radon-Nikodym derivative w^x~j is we U 
defined. Note that, trivially, this also implies that the measure /Xx dominates the measures {/U x } x6 ^- HJ p. 
443]. 

The two Hilbert spaces £f. Xo and are closely related: 

Theorem III.l (121). Consider an estimation problem £ = ( X, f(y; x), <?(•)) and a fixed reference parameter 
vector xo S X. The Hilbert spaces ££ )Xo and ~rl£, Xa are isometric; a specific congruence (i.e., isometric mapping 
of functions in %£, Xo to functions in Cg^J J[-] : %£, Xo — > J~-£, Xo is given by 

J [#£,x (•,*)] = P£,x (",x) . 

B. RKHS-based Analysis of MVE 

An RKHS-based analysis of MVE is enabled by the following central result. 

Theorem III.2 (Q, 0). Consider an estimation problem £ = (A?, /(y; x), g(-)), a fixed reference parameter 
vector xo £ X, and a prescribed bias function c(-) : X — > R, corresponding to the prescribed mean function 
7(-) = c(-) + <?(•). r/ie«, the following holds: 

• 77je bias function c(-) valid for £ at xq if and only if 7(-) belongs to the RKHS %£ )Xo - 

• 7/ 1 f/ie bias function c(-) va//<i, f/ie corresponding minimum achievable variance at Xo /s g/ve« /ry 

Af(c(-),xo) = ||7(-)llL, xo -7 2 (xo), (11) 
arac? f/ie LMV estimator at xo g/ve« /tv 

9 (xo) (-) = J[7(')]- 

This theorem shows that the RKHS %£, Xo is equal to the set of the mean functions 7(x) = E x {g(y)} of all 
estimators <?(•) with a finite variance at xq, i.e., v(g(-); Xq) < oo. Furthermore, the problem of solving the MVP 
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® can be reduced to the computation of the squared norm ||7(-)||^ £ an d the isometric image J[/y(-)] of the 
prescribed mean function 7(-), viewed as an element of the RKHS %£, Xo - This result is especially helpful if a 
simple characterization of 'He,x. is available. Here, following the terminology of [3], what is meant by "simple 
characterization" is the availability of an orthonormal basis (ONB) for %£ iXo such that the inner products of 
7(-) with the ONB functions can be computed easily. 

If such an ONB of r He^ cannot be found, Theorem IIII.2I can still be used to derive lower bounds on the 
minimum achievable variance M (c(-),Xq). Indeed, because of (fTTT) . any lower bound on ||7(-)||^ £ induces 
a lower bound on M(c(-),xq). A large class of lower bounds on ||7(-)||^ £ can t> e obtained via projections 
of 7(-) onto a subspace U C %£ >Xo . Denoting the orthogonal projection of 7(-) onto U by 7w(-)> we nave 
IM-)IIh £ „c < ll7(-)llL,x E01 Chapter 4] and thus, from CD, 



M(c(-),x ) > ||7^(-)llw £ , 3to -7 2 (xo), (12) 

for an arbitrary subspace U C %£ )Xo . In particular, let us consider the special case of a finite-dimensional 
subspace U C Hs tXo that is spanned by a given set of functions ui(-) G T~Ls,x , i-e-, 

U = spaa{u,(.)} Ie[L] = | /(•) = Y, aiUl ^ a ^ R \- (13) 

I le[L] J 

Here, ||tm(-)||« can t> e evaluated very easily due to the following expression (U Theorem 3.1.8]: 



\hu(-)fn £ , xo = 7 T Gt 7 , (14) 
where the vector 7 G M L and the matrix G G M LxL are given elementwise by 

72 = (7(-)> u /(0)^ £(3CO > G M' = ( n i(-)>^'(-))w £ , xo • (15) 

If all ui(-) are linearly independent, then a larger number L of basis functions ui(-) entails a higher dimension 
of U and, thus, a larger ||tw(-)||% £ ; this implies that the lower bound (fT2l) will be higher (i.e., tighter). In 
Section [TV] we will show that some well-known lower bounds on the estimator variance are obtained from (fT2l) 
and (fT4l . using a subspace U of the form (fT31 and specific choices for the functions ui(-) spanning U. 

C. Regular Estimation Problems and Differentiate RKHS 

Some of the lower bounds to be considered in Section [IV] require the estimation problem to satisfy certain 
regularity conditions. 

Definition III.3. An estimation problem £ = (X, /(y; x), <?(•)) satisfying ([9]) is said to be regular up to order 
m G N at an interior point xq G X° if the following holds: 



For every multi-index p G with entries pk < m, the partial derivatives — ' exist and satisfy 

' 1 <9 p /(y; 
J(y;x ) a x p 



^ x o \ ( ~TT~~ — \ I J> < 00 , / O rfl«xe6(x ,r), (16) 
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where r > is a suitably chosen radius such that B(xo,r) C X. 

For any function h(-) : W M — > M such that E x {/i(y)} exists, the expectation operation commutes with 
partial differentiation in the sense that, for every multi-index p G w/f/i < m, 

9 p /• ,,„w,...„w„ _ /" ,^dP/(y;x) 
or equivalently 



[ fc(y)/(y;x)cfy = / My) ^"^ y, /or o/Z x G B(x ,r) , (17) 



a p E x {fe(y)} J 1 5P/(y;x) ■ 

-E x ^/i(y) — — — — } , for all x G S(x ,r) , (18) 



9xP [ /(y;x) <9xP 

provided that the right hand side of (1171 ) anc? (1181 ) « finite. 

For every pair of multi-indices pi, P2 € ^+ w/f/i pi & < m ant/ p2,fc < m > ^ expectation 

1 ^/(y;x 1 )^/(y;x 2 ) 



Exo i/ 2 (y;x ) 5x Pi ax? j (19) 

depends continuously on the parameter vectors xi,X2 G B(xo,r). 

We remark that the notion of a regular estimation problem according to Definition IIII. 3 1 is somewhat similar 
to the notion of a regular statistical experiment introduced in IPT51 Section 1.7]. 

As we will show presently, the RKHS associated with a regular estimation problem has an important 
structural property, which we will term differentiable. This property has been previously considered, e.g., in 



Definition III.4. A kernel R{- , ■) : X x X — > E over a domain X C M N is said to be differentiable up to 
order m at an interior point xo G X if, for any orders pi, P2 6 ^+ with < m and p2,fc < m> the partial 
derivatives 9 p/I^* 2 ^ exist and are continuous functions of the argument (xi,X2) in a neighborhood of 
(xo, xo) (here, (xi, X2) and (xo, xo) are viewed as vectors in M. 2N ). An RKHS 'H(R) is said to be differentiable 
up to order m at an interior point xo G X° if its kernel R(- , •) is differentiable up to order m at xq. Finally, a 
kernel R(- , •) and the RKHS H(R) are said to be differentiable up to order m if they are differentiable up to 
order m at all xo G X°. 

The following relation has not been reported prior to our work in (H, to the best of our knowledge. 

Theorem III.5 (HI, Theorem 4.4.3). If an estimation problem £ = [X, /(y;x),g i (-)) is regular up to order 
m at xo G X°, then the kernel Re,x (- , •) : X x X — > K and the associated RKHS Hs.^ are differentiable up 
to order m at xo ( in the sense of Definition IIII.4D . 

A proof is provided in Appendix [A] this proof is essentially that in JU Theorem 4.4.3] but reformulated in a 
more direct fashion. 

It will be seen that, under certain conditions, the functions belonging to an RKHS 'H(R) that is differentiable 
at xq G X° are characterized completely by their partial derivatives at Xq. This implies via Theorem IHI.2I 
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together with the next theorem that, for a regular estimation problem, the mean function 7(x) = E x {<?(y)} of 
any estimator g(-) with finite variance at xo is completely specified by the partial derivatives { 9 "qJp^ | x _ x } eZ w 
(cf. Lemma MM in Section EFD]). 

Further important properties of a differentiable RKHS are stated in the following theorem. 

Theorem III.6 fl251.f7T). Let X C R N , and consider an RKHS U{R) with R(-,-) : X x X -)• M f/iaf is 
differentiable up to order m at xo G Af ( in f/ie sense o/ Definition IIII.4D . Then for any p £ w/f/i < m, 
the following holds: 

• The function {■) : X — > "R defined by 



(p)/ s a «9 P -R(x,x 2 ' 
r x ( /(x) = HTTP 



9x F 2 

is an element of 'H(R), i.e., ri^(-) G T-L(R). 

For any function /(•) G %{R), the partial derivative 9 | x _ x exists. 

The inner product of r^ (-) with an arbitrary function /(•) G z's g/ve« Z?j 

<rff(.),/(-^ " " P/(X 



(20) 



«(«) OxP 

A. JVO 

(P), 



(21) 



Thus, an RKHS H{R) that is differentiable up to order m at xo contains the functions {r^ (x)} <m , and 
the inner products of any function /(•) G %{R) with the r^o (x) can be computed easily via differentiation 
of /(•). This makes function sets |r^(x)} appear as interesting candidates for a simple characterization of 
the RKHS %{R). However, in general, these function sets are not guaranteed to be complete or orthonormal, 
i.e., they do not constitute an ONB. An important exception is constituted by certain estimation problems £ 
involving an exponential family of distributions, which will be studied in Section [VT] 

Consider an estimation problem £ = (X, /(y; x), <?(•)) that is regular up to order m G N at xo G X°. 
According to Theorem IIII.2I the mean function 7(-) of any estimator with finite variance at xo belongs to 
the RKHS %£, Xo - By Theorem IIII.5I since £ is assumed regular up to order m, %£, Xo i s differentiable up to 
order m. This, in turn, implies via Theorem IIII.6I that the partial derivatives of 7(-) at xo exist up to order 
m. Therefore, for the derivation of lower bounds on the minimum achievable variance at xo in the case of an 
estimation problem that is regular up to order m at xo, we can always tacitly assume that the partial derivatives 
of 7(-) at xo exist up to order m; otherwise the corresponding bias function c(-) = 7(-) — <?(•) cannot be valid, 
i.e., there would not exist any estimator with mean function 7(-) (or, equivalently, bias function c(-)) and finite 
variance at xo. 

IV. RKHS Formulation of Known Variance Bounds 

Consider an estimation problem £ = (X, /(y; x), and an estimator g(-) with mean function 7(x) = 
E x {^(y)} and bias function c(x) = 7(x) — g(x). We assume that g(-) has a finite variance at Xq, which implies 
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that the bias function c(-) is valid and g(-) is an element of A(c(-),Xq), the set of allowed estimators at xo for 
prescribed bias function c(-), which therefore is nonempty. Then, j(-) € He,x according to Theorem IIII.2i We 
also recall from our discussion further above that if the estimation problem £ is regular at xo up to order m, 
then the partial derivatives ^qJ^ | x _ x exist for all p G with pk < m. 

In this section, we will demonstrate how five known lower bounds on the variance — Barankin bound, 
Cramer-Rao bound, constrained Cramer-Rao bound, Bhattacharya bound, and Hammersley-Chapman-Robbins 
bound — can be formulated in a unified manner within the RKHS framework. More specifically, by combining 
(0]) with (fT2l . it follows that the variance of g(-) at xo is lower bounded as 

v(g(-);xo) > ||7w(-)ll^, xo -7 2 (xo), (22) 

where hi is any subspace of V.£ jXo ■ The five variance bounds to be considered are obtained via specific choices 
of U. 



«($(•); xo) > M(c(.),xo) = || 7 (-)llL. -7 2 (xo), (23) 



A. Barankin Bound 

For a (valid) prescribed bias function c(-), the Barankin bound ETI . ETl is the minimum achievable variance 
at xo, i.e., the variance of the LMV estimator at xo, which we denoted M(c(-),Xo). This is the tightest lower 
bound on the variance, cf. ©. Using the RKHS expression of M(c(-),xo) in (ITTb . the Barankin bound can be 
written as 

with 7(-) = c(-) + g(-), for any estimator g(-) with bias function c(-). Comparing with (l22l . we see that the 
Barankin bound is obtained for the special choice hi = %£, Xo > in which case ju(') = 7( - ) an d (1221 reduces to 

In the literature ll2lt . E71 . the following special expression of the Barankin bound is usually considered. 
Let V = {xi, . . . , x/,} C X be a subset of X, with finite size L = \T>\ G N and elements x; G A 1 , and let 
a = (ai • • • cll) T with a; G R. Then the Barankin bound can be written as 11211 Theorem 4] 

(E/efil a i [7(x/) -7(x )]) 
w(g(-);x ) > M(c(-),x ) = sup V ^- , (24) 

^' LeN ' ae ^ E Xo {(E^]^P £ ,x (y,xO) } 

where p£ jXo (y, x^) is the likelihood ratio as defined in (0 and >4x> is defined as the set of all a G M L for which 
the denominator E Xo {(E g j L j a/P£:,xn(yiXi)) 2 } does not vanish. Note that our notation sup- DCX igN >ae ^ D is 
intended to indicate that the supremum is taken not only with respect to the elements x^ of V but also with 
respect to the size of V (number of elements), L. We will now verify that the bound in (l24l) can be obtained 
from our RKHS expression in (1231 . We will use the following result that we reported in ||8] Theorem 3.1.2]. 

Lemma IV.l. Consider an RKHS U{R) with kernel R{- , •) : X x X -4 R. Let V = {xi, . . . , x L } C ^ w/f/i 
some L = \T>\ G N anc? x; G <Y, and let &= (a\ ■■■ o,l) t with ai £M.. Then the norm II-h(_r) of any function 
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/(•) G 7i(R) can be expressed as 

II /Mil E; e[ L] a* /fo) 

ll/(OII«(jn = SU P / ( 25 ) 

©c^,£eN,ae4i^ M/e[i . ] o l a J /i2(xi,x,0 

where A' v is the set of all a € /or which i'ef-t-l a i a i'^( Ji -i^i') does not vanish. 

We will furthermore use the fact — shown in [8] Section 2.3.5] — that the minimum achievable variance at xo, 
M(c(-),xo) (i.e., the Barankin bound) remains unchanged when the prescribed mean function 7(x) is replaced 
by 7(x) = 7(x) + c with an arbitrary constant c. Setting in particular c = — 7(xq), we have 7(x) = 7(x) — 7(xo) 
and 7(xo) = 0, and thus (l23l simplifies to 

*($(•); xo) > M(c(-),xo) = \m\\n £ ,« ■ ( 26 ) 

Using (1251 in (l26l ). we obtain 

foe[r] fce[L] a * frfa) -7(xo)]) 

M(c(-),x ) = sup = r = sup ^= -. 

vcx,LeN,aeA^ 2^ii'e[L] a i a i' K e,xo\ x -h x l') vcx,Len,aeAi, ve[L] a i a i' K e,x [ x -h' x -i') 

(27) 

From ([Tol l and ([8]), we have Rs, Xo (xi , X2) = E Xo { / 0£ !Xo (y,xi) / 0£ !Xo (y,X2)}, and thus the denominator in (1271) 
becomes 

aiai>R£, Xo {xi,xi') = E X J ^ aia { /p£ lXb (y,x,)p£ jXD (y,xj/) ^ = E x J I ^ a z/ 9£ )Xo (y, xj) J L 
/,/'e[L] I M'e[X] J l\ie[L] / J 

whence it also follows that ^4^, = Av- Therefore, (1271 ) is equivalent to (124b . Hence, we have shown that our 

RKHS expression d23l is equivalent to 



B. Cramer-Rao Bound 

The Cramer-Rao bound (CRB) |[T6l . E8l . E9l is the most popular lower variance bound. Since the CRB 
applies to any estimator with a prescribed bias function c(-), it yields also a lower bound on the minimum 
achievable variance M(c(-),Xq) (cf. ©). 

Consider an estimation problem £ = (X, f(y; x), <?(•)) that is regular at xo G X° in the sense that lfT6l 
Theorem 3.2], 02] Theorem 5.10] 

<91og/(y;x) 



. (28) 



x=x • 

Let xo G A* , and consider an estimator g{-) with mean function 7(x) = E x {g(y)} and finite variance at xo 
(u(^(-);xo) < 00). Then, this variance is lower bounded by the CRB 

«($(•); xo) > b T (x )J t (x )b(x ), (29) 

where b(xo) — dl Q^ | x and J(xq) G M ArxAr , known as the Fisher information matrix associated with £, is 
given elementwise by 



( J M t ,,^-.{ 81og ^ ;x)fllo t y;x) L}- 
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The regularity property (l28l) is closely related to our regularity property in Definition IIII.3I In fact, an 
estimation problem that is regular up to order m = 1 in the sense of Definition IIII.3I is also regular in the sense 
of (|28T ). To show this implication, let us assume that £ is regular up to order m = 1 at some xo € X° in the 
sense of Definition MI. 3 1 From (fT6l ) with p = e&, 

1 5/(y;x)^ 2 



< oo , for all x € B(xo,r) 



,/(y;x ) 9x fc 

where r is sufficiently small such that £>(xo,r) C X. Furthermore, frord_| (fT8l) with p = and /i(y) = 1, 



(31) 



3E X {1} 



x=x 

Now we obtain for the left hand side in 



1 



0/(y;x) 



91og/(y;x) 



1 




x=x - 





1 



/(y;x ) <9x fc 
5/(y;x) 



x=x 



(32) 



x=x 



02) 5E X {1} 



0. 



x=x 



/(y;x ) dx k 

Hence, the regularity property (1281 ) is satisfied. Our assumption of regularity up to order m = 1 in the sense 
of Definition IIII-3I is somewhat stronger than (|28T ). The reason why we use this (potentially) stronger regularity 
assumption is that it ensures that the RKHS associated with £ is differentiable up to order 1, according to 
Theorem MI.5I This differentiability is used in the proof of the following result (8l Section 4.4.2]. 

Theorem IV.2. Consider an estimation problem that is regular up to order 1 in the sense of Definition IIII.3I 
Then, for a reference parameter vector xq £ X°, the CRB in (1291 ) is obtained from (1221) by using the subspace 



CR 



with the functions 



C. Constrained Cramer— Rao Bound 



span{{v (')} U {vi{-)} le[N] } , 

A <9i?£ )Xo (",x) 



dxi 



le[N}. 



x=x 



The constrained CRB lT30l - |[32l is an evolution of the CRB in (1291 for estimation problems £ = (X, /(y; x), <?(•)) 
with a parameter set of the form 

X = {xGM iV |f(x) = 0}, (33) 

where f(-) : R N — > with Q < N is a continuously differentiable function. We assume that the set X 
has a nonempty interior. Moreover, we require the Jacobian matrix F(x) = 9t J%/ € ]R < 3 xAr to have rank Q 

2 We can invoke d 1 8b since the right hand side of HSl is finite. Indeed, this right hand side (with p = and h(y) — 1) satisfies 

i 9/(y;x 



Ex 



/(y;x ) dx k 



(a) 

< ( Ex, 



9/(y;x) 



/(y;x ) dx k 



ED 

< oo . 



In step (a), we used the Cauchy-Schwarz inequality for the Hilbert space H xo consisting of real- valued measurable functions (or 
statistics) t(y) with a finite stochastic power at xq and equipped with the inner product (ti(y), £2(y)) RV = E X0 {ti(y)t2(y)}, for any 



*i(y)> fe(y) £ Hx (cf. ©). More precisely, step (a) is obtained by setting ii(y) 



8/(y;x) | 



/(y;xo) dx k 



and i 2 (y) = 1. 
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whenever f(x) = 0, i.e., for every x 6 X. This full-rank requirement implies that the constraints represented by 
f (x) = are nonredundant [31 ). Such parameter sets are considered, e.g., in |[30l - |[32l . Under these conditions, 
the implicit function theorem ll32l Theorem 3.3], ifTTl Theorem 9.28] states that for any xo G X, with X given 
by (1331 . there exists a continuously differentiable map r(-) from an open set O C ~R N ~Q into a set V C X 
containing xo, i.e., 

r(-) : O C R N - Q 4?a, with x G V. (34) 
The constrained CRB in the form presented in [31] reads 



«($(•); x ) > b T (x )U(x ) (U T (x ) J(x )U(x )) t U T (x ) b(x ) , (35) 

where b(xo) = 9 "g^ | x , J(xo) is again the Fisher information matrix defined in (l30b . and U(xo) G R Nx ( N ~^ 
is any matrix whose columns form an ONB for the null space of the Jacobian matrix F(xo), i.e., 

F(x )U(x ) = , U T (x )U(x ) = Iat-q . 

The next result is proved in [8, Section 4.4.2]. 

Theorem IV.3. Consider an estimation problem that is regular up to order 1 in the sense of Definition IIII.3I 
Then, for a reference parameter vector xo G X °, the constrained CRB in (1351 ) is obtained from (1221) by using 
the subspace 

Uccr = span{{v (-)} U {vi(-)} le[N _ Q] } , 

with the functions 



vo(-) = i?£-,xo(-,x ) G Ue^ Q , vi(-) 



A dR £tXa (-,T(0)) 

" [ ' ! ~ 89, 



G^,x n , le[N-Q] 

0=r-i(x o ) 
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where r(-) is any continuously differentiable function of the form (1341) . 
D. Bhattacharya Bound 

Whereas the CRB depends only on the first-order partial derivatives of /(y; x) with respect to x, the Bhat- 
tacharya bound 11331 . 041 involves also higher-order derivatives. For an estimation problem £ = (X, /(y; x), 
that is regular at xo G X° up to order m G N, the Bhattacharya bound states that 

«($(•); xo) > a T (x )Bt(x )a(x ), (36) 



where the vector a(xo) G M L and the matrix B(xo) G M LxL are given elementwise by (a(xo)) ; = 

^ mrx ^ a F f 1 9 p 7(y;x)^/(y;x) \ 

[ { 0))l > l '~ Xo l/ 2 (y;x ) 9 XP! Sxp. X=X J' 

respectively. Here, the p^, / G [L] are L distinct multi-indices with (pi) k < m. 
The following result is proved in [|8l Section 4.4.3]. 



x 
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Theorem IV.4. Consider an estimation problem that is regular up to order m in the sense of Definition ITJI.3I 
Then, for a reference parameter vector xo G X°, the Bhattacharyya bound in (1361 ) is obtained from (1221 ) by 
using the subspace 

Z4 = span{{t> (-)} U {vi(-)} le[L] } , 

with the functions 

SP'i? £ , Xo (-,x 



VO(-) = ^£,x (-,X ) G %£,x , Vi(-) 



G H StXo , le[L]. (37) 

x=x 



While the RKHS interpretation of the Bhattacharya bound has been presented previously in for a specific 
estimation problem, the above result holds for general estimation problems. We note that the bound tends to 
become higher (tighter) if L is increased in the sense that additional functions vi(-) are used (i.e., in addition 
to the functions already used). Finally, we note that the CRB subspace Uqr in Theorem IIV.2I is obtained as a 
special case of the Bhattacharya bound subspace Ub by setting L = N, m = 1, and = in (1371) . 

E. Hammer sley-Chapman-Robbins Bound 

A drawback of the CRB and the Bhattacharya bound is that they exploit only the local structure of an 
estimation problem £ around a specific point xo G X° 11331 . As an illustrative example, consider two different 
estimation problems £\ = (Xi, /(y; x), <?(•)) and £2 = (X2, /(y; x ), <?(•)) w i tn tne same statistical model 
/(y;x) and parameter function <?(•) but different parameter sets X\ and X2. These parameter sets are assumed 
to be open balls centered at xo with different radii n and r%, i.e., X\ = B(xo,ri) and X2 = £>(xo,r2) with 
fi 7^ r 2- Then the CRB at xo for both estimation problems will be identical, irrespective of the values of 
n and r%, and similarly for the Bhattacharya bound. Thus, these bounds do not take into account a part of 
the information contained in the parameter set X. The Barankin bound, on the other hand, exploits the full 
information carried by the parameter set X since it is the tightest possible lower bound on the estimator variance. 
However, the Barankin bound is difficult to evaluate in general. 

The Hammersley-Chapman-Robbins bound (HCRB) |[35l - |[37l is a lower bound on the estimator variance 
that takes into account the global structure of the estimation problem associated with the entire parameter set X. 
It can be evaluated much more easily than the Barankin bound, and it does not require the estimation problem 
to be regular. Based on a suitably chosen set of "test points" {xi, . . . ,x^} C X, the HCRB states that ll35l 

v(£(-);x ) > m T (xo)Vt(xo)m(x ), (38) 

where the vector m(xo) G ffi L and the matrix V(xo) G IR LxL are given elementwise by (m(xo))^ = 7(x;) — 
7(x ) and 

i^ T( s\ a p J [/(y; x;) -/(y; x )] [/(y; x P ) -/(y; x )] 
(V(xo)) M , = E X0 | 

respectively. 

The following result is proved in [U Section 4.4.4]. 
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Theorem IV.5. The HCRB in (138b . with test points {x;} ig j L j C X, is obtained from (1221) by using the subspace 

Whcr = span{{v (-)} U {vi(-)} le[L] } , 

with the functions 

uo(-) = -Re,xo(-,x ) e ne, Xo , vi{-) = Re^ (-,x{) -Re,^ (-,^o) , le[L]. 

The HCRB tends to become higher (tighter) if L is increased in the sense that test points x; or, equivalently, 
functions vi(-) are added to those already used. 

F. Lower Semi -continuity of the Barankin Bound 

For a given estimation problem £ = (X, /(y;x),g(-)) and a prescribed bias function c(-), it is sometimes 
of interest to characterize not only the minimum achievable variance M(c(-),Xo) at a single parameter vector 
xo € X but also how M(c(-),Xo) changes if xo is varied. The following result is proved in Appendix IB"1 

Theorem IV.6. Consider an estimation problem £ = /(y; x), #(•)) with parameter set X C M. N and a 
prescribed bias function c(-) : X — )■ R that is valid at all xo G C /or some o/?e« se? CCA 1 and for which the 
associated prescribed mean function 7Q = c(-) + g(-) a continuous function on C. Furthermore assume that 
for any fixed xi,X2 € Af, i?£:, Xo ( x i5 x 2) w continuous with respect to xq on C, i.e., 

lim R £jX < o (x 1 ,x 2 ) = i?£ jXo (xi,x 2 ) , Vx GC, Vxi,x 2 € Af. (39) 

77zen, f/ie minimum achievable variance M(c(-),x), v/ewet/ as a function of x, is lower semi-continuous on C. 

A schematic illustration of a lower semi-continuous function is given in Fig. [TJ The application of Theorem 
IIV.6I to the estimation problems considered in [38] — corresponding to the linear/Gaussian model with a sparse 
parameter vector — allows us to conclude that the "sparse CRB" introduced in [38 ] cannot be maximally tight, 
i.e., it is not equal to the minimum achievable variance. Indeed, the sparse CRB derived in [38] is in general a 
strictly upper semi-continuou function of the parameter vector x, whereas the minimum achievable variance 
M(c(-),x) is lower semi-continuous according to Theorem IIV.61 Since a function cannot be simultaneously 
strictly upper semi-continuous and lower semi-continuous, the sparse CRB cannot be equal to M(c(-),x). 

V. Sufficient Statistics 

For some estimation problems £ = [X, /(y;x), <?(•)), the observation y G R A/ contains information that 
is irrelevant to £, and thus y can be compressed in some sense. Accordingly, let us replace y by a transformed 
observation z = t(y) £ M K , with a deterministic mapping compression is achieved 

if K < M. Any transformed observation z = t(y) is termed a statistic, and in particular it is said to be a 
sufficient statistic if it preserves all the information that is relevant to £ (TJ, |fT3l - lfl6ll . |[39l . In particular, a 

3 A function is said to be strictly upper semi-continuous if it is upper semi-continuous but not continuous. 
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/(x) 



xo x 

Fig. 1. Graph of a function that is lower semi-continuous at Xq. The solid dot indicates the function value /(xo). 



sufficient statistic preserves the minimum achievable variance (Barankin bound) M(c(-),xo). In the following, 
the mapping t(-) will be assumed to be measurable. 

For a given reference parameter vector xo € X, we consider estimation problems £ = (X, f(y; x), 
for which there exists a dominating measure [i£ such that the pdfs {/(y;x)} xgA . are well defined with respect 
to fi£ and condition © is satisfied. The Neyman-Fisher factorization theorem |[T3l - |[T6l then states that the 
statistic z = t(y) is sufficient for £ = (X, /(y;x),g(-)) if and only if /(y;x) can be factored as 

/(y;x) = fc(t(y);x)A:(y), (40) 

where h(- ; x) and k(-) are nonnegative functions and the function k(-) does not depend on x. The relation (l40l) 
has to be satisfied for every y £ IR M except for a set of measure zero with respect to the dominating measure 
He- 

The probability measure on HH K (equipped with the system of K-dimensional Borel sets, cf. HI Section 
10]) that is induced by the random vector z = t(y) is obtained as \i\ = lfT4l . lfT51 . According to Section 

IIII-AI under condition ((6]), the measure /Xx dominates the measures {li%.} x <=x- This, in turn, implies via lTT4l 
Lemma 4] that the measure /x^ dominates the measures {/^xlxeA"' anc ^ therefore that, for each x G X, there 
exists a pdf /(z;x) with respect to the measure /x^ . This pdf is given by the following result. (Note that we 
do not assume condition (O.) 

Lemma V.l. Consider an estimation problem £ = i^X , f(y; x), <?(•)) for which there exists a dominating 
measure which is such that the Radon-Nikodym derivative of /^x with respect to p%_ is well defined and 
given by the likelihood ratio j^x~)- Furthermore consider a sufficient statistic z = t(y) for £. Then, the pdf 
of z with respect to the dominating measure p^ is given by 



/(z;x) 



/i(z; x) 



h(z;x ) ' 

where the function /i(z;x) is obtained from the factorization (140b . 

Proof: The pdf /(z; x) of z with respect to \i\ a is defined by the relation 

E Xo {/4(z)/(z;x)} =P x {zGi}, 



(41) 



(42) 
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which has to be satisfied for every measurable set A C M K |1]. Denoting the pre-image of A under the mapping 
t(-) by t~ x (A) = (y|t(y) G A} C R M , we have 



/i(z;x )J Xo i w// /i(t(y);x ) 

F St m Wfr);*) 1 
<n F /r ^ /(y; x ) 1 

^Pxlyet- 1 ^)} 

= P x {zGi}, (43) 

where step (a) follows from [1, Theorem 16.12] and (b) is due to the fact that the Radon-Nikodym derivative 

/(y;x) 
/(y;x ) ' 



of with respect to // Xo is given by // y,x \ , as explained in Section UlI-AI Comparing (1431 with (1421) . we 



conclude that ^.'^ = /(z;x) up to differences on a set of measure zero (with respect to /U^ ). Note that 
because we require t(-) to be a measurable mapping, it is guaranteed that the set t^ 1 (A) = |y|t(y) 6 ,4} is 
measurable for any measurable set A C R^. □ 

Consider an estimation problem £ = (A" , /(y; x), #(■)) satisfying (O, so that the kernel R£ jXo (-, •) exists 
according to (fTOl) . Let z = t(y) be a sufficient statistic. We can then define the modified estimation problem 
£' = (X, /(z;x), #(•)), which is based on the observation z and whose statistical model is given by the pdf 
/(z;x) (cf. (HTTP . The following theorem states that the RKHS associated with £' equals the RKHS associated 
with £ . 

Theorem V.2. Consider an estimation problem £ = [X, /(y; x), <?(•)) satisfying (O and a reference parameter 
vector xo € Af. Furthermore, for a sufficient statistic z = t(y), consider the modified estimation problem 
£' = /(z; x), #(•)) w/f/i dominating measure [i£> = /i Xf) . ^ OT > £' satisfies (|9]) and furthermore 
#£',xo(->') = ^,x (t) and %^ iXo = %£ jXo . 

Proof: We have 

o / Y y n<U f f /(y;xi)/(y;x 2 ) | 

ife,xo(xi,x 2 ) - E X0 | /2(y . xo) I 

IP E / /i ( t (y); x i) /i ( t (y); x 2) 



-x 



I /i 2 (t(y);x ) 



(o) f /i(z;xi)/i(z;x 2 ) 



-x 



/i 2 (z;x ) 



|p E (/(z;xi)/(z;x 2 ) 



-x 



/ 2 (z;x ) 



®^ )X0 (x 1) x 2 ), (44) 
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where, as before, step (a) follows from |1, Theorem 16.12]. From (l44l . we conclude that if £ satisfies (|9]) 
then so does £'. Moreover, from i?£' )Xo (-,-) = Re,x (-,') in (144b . it follows that He'^o = T-L(R£' :Xa ) equals 
n £ , Xo =H(R e , Xo ). □ 

Intuitively, one might expect that the RKHS associated with a sufficient statistic should be typically "smaller" 
or "simpler" than the RKHS associated with the original observation, since in general the sufficient statistic 
is a compressed and "more concise" version of the observation. However, Theorem IV. 21 states that the RKHS 
remains unchanged by this compression. One possible interpretation of this fact is that the RKHS description 
of an estimation problem is already "maximally efficient" in the sense that it cannot be reduced or simplified 
by using a compressed (yet sufficiently informative) observation. 

VI. MVE for the Exponential Family 

An important class of estimation problems is defined by statistical models belonging to an exponential 
family. Such models are of considerable interest in the context of MVE because, under mild conditions, the 
existence of a UMV estimator is guaranteed. Furthermore, any estimation problem that admits the existence 
of an efficient estimator, i.e., an estimator whose variance achieves the CRB, must be necessarily based on an 
exponential family [13. Theorem 5.12]. In this section, we will characterize the RKHS for this class and use 
it to derive lower variance bounds. 



A. Exponential Family 

An exponential family is defined as the following parametrized set of pdfs {/(y;x)} xgA . (with respect to 
the Lebesgue measure on R M ) US, El, ED: 

/(y;x) = e X p(</> T (y)u(x)-A(x))/i(y), 

with the sufficient statistic <p(-) : R — > R p , the parameter function u(-) : W N — > W p , the cumulant function 
A(-) : ~R N — > R, and the weight function h(-) : R M — > R. Many well-known statistical models are special 
instances of an exponential family fill . Without loss of generality, we can restrict ourselves to an exponential 
family in canonical form |[T3l . for which P = N and u(x) = x, i.e., 

(y ; x) = exp (0 T (y) x - A(x)) h(y) . (45) 

Here, the superscript ^ emphasizes the importance of the cumulant function A(-) in the characterization of an 
exponential family. In what follows, we assume that the parameter space is chosen as X C M , where M C M N 
is the natural parameter space defined as 



A/" = < x € 



DiV 



exp(0 T (y)x)/i(y)dy < 



DC 
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From the normalization constraint L M (y ; x) dy = 1, it follows that the cumulant function A(-) is 
determined by the sufficient statistic <p(-) and the weight function h(-) as 

A(x) = log^^exp(<£ T (y)x)/i(y) V) , xGA/\ 

The moment-generating function of /^(yi x ) * s defined as 

A(x) 4 exp(A(x)) = / exp(0 T (y)x) h(y) dy , xGA/\ (46) 

Note that 

M = |x€M Af |A(x) < oo} . (47) 

Assuming a random vector y ~ /(^(y;x), it is known BTJ1 Theorem 2.2], BT1 Proposition 3.1] that for any 
x G X° and p G the moments E x {0 p (y)} exist, i.e., E x {0 p (y)} < oo, and they can be calculated from 
the partial derivatives of A(x) according to 

Thus, the partial derivatives ^faf 1 exist for any x € A" 5 and p £ Z+ , and for any choice of the sufficient statistic 
</>(•) and the weight function h(-). Moreover, they depend continuously on x G X° 1401 . BT1 . 

B. RKHS Associated with an Exponential Family Based MVP 

Consider an estimation problem £^ A > = (X, f^ A \y; x), #(•)) with an exponential family statistical model 
{f( A \y; x)} xgA - as defined in (1431) . and a fixed xo G ^f. Consider further the RKHS Hew^- Its kernel is 
obtained as 



R fx x)® e f / (A) (y;xi)/^(y;x 2 ) 
^^(xx.x.) - E Xo j (/ (A) (y;Xo))2 

EJJ ( exp(0 r (y)xi -^(xi)) exp(0 T (y)x 2 - A(x 2 )) \ 
Xo \ exp(2[^(y)xo-A(xo)]) J 

= E Xo { exp ((f> T (y) ( Xl + x 2 - 2x ) - A(xi) - A(x 2 ) + 2A(x )) } 

1* exp(yl(x 1 )- J 4(x 2 ) + 2A(x )) 

x / exp(0 T (y)(xi+x 2 -2x o ))exp(0 T (y)x o -A(x o ))/i(y)(iy 

Jr m 

= exp(-A(xi)-A(x 2 ) + A(x )) / exp(0 T (y)(xi + x 2 -x o ))/i(y)dy 
ED A(xi + x 2 -x )A(x ) 



(49) 



(50) 



A( Xl )A(x 2 ) 

Because (l49l and (l50b are equal, we see that condition © is satisfied, i.e., E Xo | ^ ^/w^/xo^'*^ } ^ 00 

1+X2-X0) A 
A( Xl )A(x 2 ) 



for all Xi,x 2 G Af, if and only if A ( x i+ X 2 *o) a(x ) ^ ^ ^ or a ^ Xl)X2 g x. Since xq G C J\f, we have 
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A(xq) < oo. Furthermore, A(x) ^ for all x G X. Therefore, (O is satisfied if and only if A(xi+X2 — xo) < oo. 
We conclude that for an estimation problem whose statistical model belongs to an exponential family, condition 
Q is equivalent to 

Kt,x 2 eX =>• xi+x 2 -x G W. (51) 

Furthermore, from (1501) and the fact that the partial derivatives exist for any x G X° and p G and 

depend continuously on x G X°, we can conclude that the RKHS Hg(A) Xo is differentiable up to any order (cf. 
Definition IIII.4I ). We summarize this finding in 

Lemma VI. 1. Consider an estimation problem £^ = (A", /^(y; x), <?(•)) associated with an exponential 
family (cf. (1451 )) with natural parameter space J\f. The parameter set X is assumed to satisfy condition (151b 
for some reference parameter vector xo G X. Then, the kernel Rg(.A) Xo (xi,X2) and the RKHS %£<a) Xo are 
differentiable up to any order m ( in the sense of Definition IIII.4D . 

Next, by combining Lemma I VI. 1 1 with Theorem IIII.6I we will derive simple lower bounds on the variance 
of estimators with a prescribed bias function. 



C. Variance Bounds for the Exponential Family 

If X° is nonempty, the sufficient statistic </>(•) is a complete sufficient statistic for the estimation problem 
£w, and thus there exists a UMV estimator ffuMv(') for any valid bias function c(-) lfT3l p. 42]. This UMV 
estimator is given by the conditional expectation! 

ffuMv(y) = E x {# o (y)|0(y)} , (52) 

where cjq(-) is any estimator with bias function c(-), i.e., 6(go(-); x o) = c(x) for all x G X. The minimum 
achievable variance M(c(-),Xo) is then equal to the variance of g UMV (-) at xo, i.e., M(c(-),Xq) = v(g UMW (-); xo) 
lfT3l p. 89]. However, it may be difficult to actually construct the UMV estimator via (1521) and to calculate its 
variance. In fact, it may be already a difficult task to find an estimator go(-) whose bias function equals c(-). 
Therefore, it is still of interest to find simple closed-form lower bounds on the variance of any estimator with 
bias c(-). 

Theorem VI.2. Consider an estimation problem £^ = (X, f^ A \y',^),g(-)) with parameter set X satisfying 
(1511 ) and a finite set of multi-indices {pi}i^\n Q Then, at any xoG X°, the variance of any estimator g(-) 
with mean function 7(x) = E x {<?(y)} and finite variance at xo is lower bounded as 

v(£(-);xo) > n T (xo)St(x )n(x ) - 7 2 (x ) , (53) 

4 The conditional expectation in ( 1521 ) can be taken with respect to the measure for an arbitrary x G X. Indeed, since </>(■) is a 
sufficient statistic, E x {go(y)|</>(y)} yields the same result for every x G X. 
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where the vector n(xo) G M L and the matrix S(xq) G K" 



are given elementwise by 






denotes the sum over all multi-indices p G swc/i f/iaf p& < (pi)k far k € [iV], 



A proof of this result is provided in Appendix O This proof shows that the bound (I53T ) is obtained by 
projecting an appropriately transformed version of the mean function 7( ) onto the finite-dimensional subspace 
U = span{r^''(')} ;g r L , of an appropriately defined RKHS T-l(R), with the functions r x ^ (•) given by (l20l) . 



subspace tends to become higher-dimensional and in turn the lower bound (1531) becomes higher, i.e., tighter. 

The requirement of a finite variance v(g(-);xo) in Theorem IVI.2I implies via Theorem IIII.2I that r y(-) G 
T-Ls(A)^ g . This, in turn, guarantees via Theorem IIII.6I — which can be invoked since due to Lemma fVI.ll the 
RKHS %£<A) Xo is differentiable up to any order at xo — the existence of the partial derivatives | x _ x - 

Note also that the bound (T53T ) depends on the mean function 7(-) only via its local behavior as given by the 
the partial derivatives of 7(-) at xo up to a suitable order. 

Evaluating the bound (1531 ) requires computation of the moments E Xfl {0 p (y)}. This can be done by means 
of message passing algorithms fiTTl . 

D. Reducing the Parameter Set 

Using the RKHS framework, we will now show that, under mild conditions, the minimum achievable 
variance M(c(-),xo) for an exponential family type estimation problem £^> = {X, f^ A \y; x), is in- 
variant to reductions of the parameter set X. Consider two estimation problems £ = (<Y, /(y; x), and 
£' = (^', /(y; x), g{-)\ — for now, not necessarily of the exponential family type — that differ only in their 
parameter sets X and X'. More specifically, £' is obtained from £ by reducing the parameter set, i.e., X' CI X. 
For these two estimation problems, we consider corresponding MVPs at a specific parameter vector xo € X' 
and for a certain prescribed bias c(-). More precisely, c(-) is the prescribed bias for £ on the set X, while 
the prescribed bias for £' is the restriction of c(-) to X', c{-)\ x ,. We will denote the minimum achievable 
variances of the MVPs corresponding to £ and £' by M(c(-),xo) and M'(c(-)| A> , ,xo), respectively. From (l24l ). 
it follows that M'(c(-) l^xo) < M(c(-),xo), since taking the supremum over a reduced set can never result 
in an increase of the supremum. 

The effect that a reduction of the parameter set X has on the minimum achievable variance can be analyzed 
conveniently within the RKHS framework. This is based on the following result. 




the 
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Theorem VI.3 (033). Consider an RKHS H(Ri) of functions /(•) : Pi ->• R, w/f/i feme/ , •) : PiXPi — >■ R, 
Lef P2 S Pi. 77ie«, f/ie sef of functions {/(•) — : /(') ^ ^(-^1)} iJ obtained by restricting each 

function /(•) G H{R\) to the subdomain T>2 coincides with the RKHS %(i?2) w/iose kernel i?2(' > •) : P2 X P2 - > 
R zj f/ie restriction of the kernel R±(- , •) : P1XP1 — > R to f/ie subdomain P2XP2, i.e., it^O > •) — , ') |-p_ X D ' 
Furthermore, the norm of an element /(•) G K(i?2) w equal to the minimum of the norms of all functions 
/(•) G H(Ri) that coincide with /(•) on P2, /.e., 

= /{) ^ i) ||/(-)||^ l) . (56) 

/(•)L =/(■) 

Consider an arbitrary but fixed /(•) G %{R{), and let /(•) = /(0|xj • Because /(•) G H(R2), we can 
calculate \\f(-)\\ H{R2 y From we obtain for \\f(-)\\ n(R2) = \\f(-)\ V2 \\ H(R . 2) the inequality 

\\f(-)U\u(R 2 ) Z ll/OIW ( 57 ) 

This inequality holds for all /(•) G T-L(Ri). 

Let us now return to the MVPs corresponding to £ and £'. From (1571) with V\ = X , T>2 = X', %{Ri) = 
Hs,x , and %{R2) = %£',x Q , we can conclude that, for any xo G X', 

M'(c(.)|^,x ) ® hOl^HL -7 2 (xo) f ll7(-)llL, -7 2 (xo) = M(c(-),x ). (58) 

t ,XQ ' U 

Here, we also used the fact that 7(*)|#/ = c (')|^" + d{')\x'- ^ ne inequality in d58l) means that a reduction of 
the parameter set X can never result in a deterioration of the achievable performance, i.e., in a higher minimum 
achievable variance. Besides this rather intuitive fact, Theorem I VI. 3 1 has the following consequence: Consider an 
estimation problem £ = (X, /(y; x), <?(•)) whose statistical model {/(y; x)} xg;t > satisfies © at some xoG^ 
and moreover is contained in a "larger" model {/(y;x)} xg ^ with X D X. If the larger model {f{y,~x-)} xe x 
also satisfies (O, it follows from Theorem IVI.3I that a prescribed bias function c(-) : X —> R can only be valid 
for £ at xo if it is the restriction of a function c'(-) : — >• R that is a valid bias function for the estimation 
problem £= [X, /(y; x), g(-)) at xq. This holds true since every valid bias function for £ at xo is an element 
of the RKHS "Hf, X0 ' which by Theorem IVI.3I consists precisely of the restrictions of the elements of the RKHS 
^■e x ' which by Theorem IIII.2I consists precisely of the mean functions that are valid for £ at xo (see the 
remark made immediately after Theorem IIII.21 ). 

For the remainder of this section, we restrict our discussion to estimation problems £(- A > = (X, f^ A \y; x), <?(■)) 
whose statistical model is an exponential family model. The next result characterizes the analytic properties of 
the mean functions 7(-) that belong to an RKHS %sw- x . - A proof is provided in Appendix iDl 

Lemma VI.4. Consider an estimation problem £^> = (X, f( A ' (y; x), #(■)) with an open parameter set X C TV 
satisfying (1511 ) for some xo G X. Let 7(-) G Kg(^) )Xo be such that the partial derivatives | x _ x vanish 

for every multi-index p G Z+. Then 7(x) = for all x G X. 
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Note that since Hg(A) Xo is differentiable at xo up to any order (see Lemma IVI.ll) . it contains the function 
set { r x^(x)} pgZlv defined in Theorem IIII.61 Moreover, by (|2TT) . for any /(•) G %£<^) iXo an d any p G Z^, 
there is (ri^(-),/(-)) M ^ = 9 | x _ x ■ Hence, under the assumptions of Lemma fVI.41 we have that if a 
function /(•) G Hg(A) tXo satisfies (rig (•)> /("))•« = *-* f° r au P e tnen /(') = 0- Thus, in this case, 
the set { r^J (x) } pgZ « is complete for the RKHS HgiA)^. 

Upon combining Theorem IVI.3I with Lemma IVI.4I we arrive at the second main result of this section: 

Theorem VI.5. Consider an estimation problem £^ = (X, /^(y; x), #(■)) with an open parameter set 
X C J\f satisfying (151b for some xo G Af, and a prescribed bias function c(-) f/jfltf valid for £^ at xo 
Furthermore consider a reduced parameter set X\ C X such that xo G X°. Let £^ = [X\, /^(y; x); g(-)) 
denote the estimation problem that is obtained from £^ A ' by reducing the parameter set to X\, and let ci(-) — 
c(-)|^, . Then, the minimum achievable variance for the restricted estimation problem £^ and the restricted 
bias function ci(-), denoted by Mi(ci(-),xo), is equal to the minimum achievable variance for the original 
estimation problem 8^ and the original bias function c(-), i.e., 

Mi(ci(-),x ) = M(c(-),x ). 

A proof of this theorem is provided in Appendix El Note that the requirement xo G X° of the theorem 
implies that the reduced parameter set X\ must contain a neighborhood of Xo, i.e., an open ball S(xo,r) with 
some radius r > 0. The main message of the theorem is that, for an estimation problem based on an exponential 
family, parameter set reductions have no effect on the minimum achievable variance at xo as long as the reduced 
parameter set contains a neighborhood of xo- 

VII. Conclusion 

The mathematical framework of reproducing kernel Hilbert spaces (RKHS) provides powerful tools for the 
analysis of minimum variance estimation (MVE) problems. Building upon the theoretical foundation developed 
in the seminal papers J2 and 0, we derived novel results concerning the RKHS-based analysis of lower 
variance bounds for MVE, of sufficient statistics, and of MVE problems conforming to an exponential family of 
distributions. More specifically, we presented an RKHS-based geometric interpretation of several well-known 
lower bounds on the estimator variance. We showed that each of these bounds is related to the orthogonal 
projection onto an associated subspace of the RKHS. In particular, the subspace associated with the Cramer- 
Rao bound is based on the strong structural properties of a differentiable RKHS. For a wide class of estimation 
problems, we proved that the minimum achievable variance, which is the tightest possible lower bound on the 
estimator variance (Barankin bound), is a lower semi-continuous function of the parameter vector. In some 
cases, this fact can be used to show that a given lower bound on the estimator variance is not maximally 
tight. Furthermore, we proved that the RKHS associated with an estimation problem remains unchanged if the 
observation is replaced by a sufficient statistic. 
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Finally, we specialized the RKHS description to estimation problems whose observation conforms to an 
exponential family of distributions. We showed that the kernel of the RKHS has a particularly simple expression 
in terms of the moment-generating function of the exponential family, and the RKHS itself is differentiable up 
to any order. Using this differentiability, we derived novel closed-form lower bounds on the estimator variance. 
We also showed that reducing the parameter set has no effect on the minimum achievable variance at a given 
reference parameter vector xo if the reduced parameter set contains a neighborhood of xo. 

Promising directions for future work include the practical implementation of message passing algorithms 
for the efficient computation of the lower variance bounds for exponential families derived in Section IVI-CI 
Furthermore, in view of the close relations between exponential families and probabilistic graphical models fiTTl . 
it would be interesting to explore the relations between the graph-theoretic properties of the graph associated 
with an exponential family and the properties of the RKHS associated with that exponential family. 



Appendix A 
Proof of Theorem IIII.5I 

We will use the Hilbert space H Xo generated by real- valued measurable functions (or statistics) t(y) with 
a finite stochastic power at xo S X°, i.e., H Xo = (i(y) E Xo {i 2 (y)} < oo}. This Hilbert space is equipped with 
the inner product (ti(y), ^(y)} RV = Ex {>i(y)*2(y)}, for any h(y),t 2 (y) G U x „ (cf. ®). 

We consider an estimation problem 8 = {X , /(y; x), g(-)) that is regular up to order m at xj £ X°. Let 
£?(xo,r) C X (cf. Definition IIII.3I ). Then, for any two multi-indices pi,P2 € ^+ with < m and £>2,fc < m, 
and for any two parameter vectors xi,X2 E Z3(xo,r), we have 



OS) 

oo > 



(a) 

> E 



-x 



i d pi /(y;xi; 

/(y;x ) 9xf 

i £^/(y;xi) 



x 1 /(y;x ) 
l 



pi 



i 



1 dP3/(y;x 2 ; 

~ x ' l J(y;x ) <9x P2 



5 P2 /(y;x 2 ) 



/(y;x ) 



dx P2 



/(y;x ) 



^/(y;xi)^/(y;x 2 ) 



3x P 



Pi 



dx P2 



> 



l a p 7(y;xi)a p V(y;x 2 ) 



/(y;x ) dx P 



Pi 



5x P2 



dy 



dy 



(59) 



^|^^|andt 2 (y) 



Here, (a) is due to the Cauchy-Schwarz inequality in % Xo applied to ii(y) — j^jx^yi — 7^ 

1 | 8 " 2 /^f 2) | . By (T59]), we have that the integral J RM 8P1 ^; Xl) ^^l'^ dy is finite. We can thus 



/(y;x n ) I a _ 

use it as the right hand side of (fTTl) . with h(y) — j( y . Xo ) — q^ft 



rix 



1 d"if(y; Xl ) 



, to obtain further 



l a p 7(y;xi)^/(y;x 2 ) 



/(y;x ) dx P 



pi 



<9x P2 



dy 



5 P2 f /(y;x 2 )5 p 7(y;x 1 



5x P 



(&) 



/(y;x ) <9x Pl 
<)^<P> f /(y;x 1 )/(y;x 2 



<9x Pl dx P2 



/(y;x ) 



dy 
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a pi a p2 ^, Xn (xi,x 2 ) 

dx Pl dx P2 



(60) 



Here, (6) follows by another application of (fTTl ). this time for h(y) = 77^^- Hence, upon combining 
(l60l with (1591 ). we conclude that 



dPldP2_R £3eo ( XliX2 ) 



/(y;x )' 

< 00. We have thus shown that the partial derivatives 



9 19 dS^d-x^ 1 '* 2 ^ ex i st ^ or an PI1P2 G ^+ with pi^ < m and p 2 ,fc < m an d f° r au xi,x 2 G S(xo,r). 
Moreover, these partial derivatives are continuous functions of xi and x 2 , because (due to (l60l ) they are given 
by the expression 

1 <9 p 7(y;xi)<9 P2 /(y;x 2 ) , _ f 1 d^f(y; xi) <9 P2 /(y; x 2 ) 



/(y;x ) <9x Pl dx P2 dy Ex °\/2(y;x ) 9xf 9x P2 

which varies continuously with xi and x 2 as assumed in Definition IIII.3 1 (see (fT9l)). We conclude that the kernel 
R£,x ('i ") i s differentiable up to order m. 

Appendix B 
Proof of Theorem IIV.6I 

We first note that our assumption that the prescribed bias function c(-) is valid for £ at every x G C has 
two consequences. First, M(c(-),x) < 00 for every x G C (cf. our definition of the validity of a bias function 
in Section |nl); second, as stated by Theorem IIII.2I the prescribed mean function 7Q = c(-) + g(-) belongs to 
He,K for every x G C. 

Following |2], we define the linear span of a kernel function R(- , ■) : X X X — > R, denoted by C(R), as 
the set of all functions /(•) : X — > R that are finite linear combinations of the form 

f{-) = ^2aiR(',xt), with XiG*, a,GM, LgN. (61) 
ze[L] 

The linear span C(R) can be used to express the norm of any function h(-) G %{R) according to 

iii,mii2 ( h (-)J(-))n(R) 

\\H-)\\h{R) = SU P u f( M,2 • ( 62 ) 

f(-)eC(R) \\j{-)\\h(R) 

II/(-)II«(r)>0 

This expression can be shown by combining ['8, Theorem 3.1.2] and [8, Theorem 3.2.2]. We can now develop 
the minimum achievable variance M(c(-),x) as follows: 

M(c(-),x) ® || 7( .)||^_ 7 2 (x) 

{62} (7(0. /(•)>««,„ 2 , , 

= sup ^ 7 (x) . 

/(•)e£(fl £ ,x) H/(-)llw £x 
ll/(-)ll^ x >o 

Using ([6B and letting V = {x 1; . . . ,x L }, a = (01 • • • a L ) T , and Ad = {aG M L | 52i,i'e[L] a l a l'Re,x{ yi h'Xl') > 



0}, we obtain further 



M(c(-),x) = sup fez?, a (x). (63) 
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Here, our notation sup-pcA", lgn, a£Ai> indicates that the supremum is taken not only with respect to the elements 
x/ of V but also with respect to the size of T>, L = \T>\, and the function hj),a{-) '■ X — > R is given by 

«©,a(x) = -J 7 ( x ) 

||Ez G [L] a i R £,xVi x i)\\n E , x 

- 7 (x) 



g (E/ g[ L] ^7(x ; )) 2 _ ^ 

For any finite set £> = {xj., . . . ,x^} C X and any a G At>, it follows from our assumptions of continuity of 
R£,x(- j •) with respect to x on C (see d39b ) and continuity of 7(x) on C that the function /i£> ia (x) is continuous 
in a neighborhood around any point xo G C. Thus, for any xo G C, there exists a radius 5q > such that /ix> ia (x) 
is continuous on ,6(xo,<5o) C C. 

We will now show that the function M(c(-),x) given by (1631) is lower semi-continuous at every xo G C, 
i.e., for any xo G C and e > 0, we can find a radius r > such that 

M(c(-),x) > M(c(-),x ) - e, for all xG#(x ,r). (64) 

Due to (l63l) . there must be a finite subset T>o C X and a vector ao G Av Q such thajfl 

/i©„,a„(x ) > M(c(-),x )-|, (65) 

for any given e > 0. Furthermore, since /ix> ,a (x) is continuous on -8(xo, §o) as shown above, there is a radius 
ro > (with ro < <5o) such that 

ft-x>o,a (x) > ^D ,a (xo)-|, for all xGfi(x ,r ). (66) 

By combining this inequality with (|65T ). it follows that there is a radius r > (with r < (5o) such that for any 
x G £>(xo,r) we have 



and further 



/^o,a (x) > /^ ,ao(x )- " > M( C (-), X ) - £ , (67) 

M(c(-),x) ^ sup /^(x) > ^ , ao (x) > M(c(-),xo)-e. 

Thus, for any given e > 0, there is a radius r > (with r < 5q) such that M(c(-),x) > M(c(-),xo) — e for all 
xG23(xo,r), i.e., (1641) has been proved. 

5 Indeed, if d65t were not true, we would have /i-D, a (xo) < M(c(-),xo) — e/2 for every choice of T> and a. This, in turn, 
would imply that sup BCA . LgN ag _4 D /ir>, a (xo) < M(c(-),xo) — e/2 < Af(c( ),xo), yielding the contradiction M(c(-),xo) — 

su PBC^,L6I,a6^ fe,a(x ) < A/( C (-),X ). 



28 



Appendix C 
Proof of Theorem IVI.2I 

The bound (l53l) in Theorem IVI.2I is derived by using an isometry between the RKHS Hew Xo and the 
RKHS H(R) that is defined by the kernel 

R(-,.):XxX->R, fl(xi,x 2 ) = A ( x i + x 2-x ) _ (68) 

A(x J 

It is easily verified that , •) and, thus, T-L(R) are differentiable up to any order (cf. Definition 1111.4b - Invoking 
00 Theorem 3.3.4], it can be verified that the two RKHSs Wgw^ and 7i(R) are isometric and a specific 
congruence J : %£(.a) Xo — >■ H(R) is given by 

J[/(-)] = ^y/(x). (69) 

Similarly to the bound (1221 ). we can then obtain a lower bound on v(g(-); xq) via an orthogonal projection onto 
a subspace of %{R). Indeed, with c(-)=7(-) — <?(•) denoting the bias function of the estimator g(-), we have 



v(g(-);xo) > M(c(-),x ) 

^Il7(-)ll^ o -7 2 (xo) 
- ll J [7(-)]||^ (i?) -7 2 (xo) 

> ll( J [7(-)]) w ||^)-7 2 (xo), (70) 

for an arbitrary subspace hi C 'H(R). Here, step (a) is due to the fact that J is a congruence, and {-) u 
denotes orthogonal projection onto hi. The bound (1531 is obtained from (1701 by choosing the subspace as 
W = span{r£ !) (-)} ie[L] , with the functions r£'\-) eJi(R) as defined in Q®, i.e., r£ ,} (x) = ^§^| X2=Xo - 
Let us denote the image of 7(-) under the isometry J by 7(-) = J[7(-)]. According to (l69l . 

<71) 

Furthermore, the variance bound (1701 reads 

^(tK-); x o) > ||t^(0II«(h) -7 2 (xo)- 

Using ([T4l . we obtain further 

«($(•); *<,) > n T (x )S t (x )n(x ) - 7 2 (x ) , (72) 
where, according to (1T51 ). the entries of n(xo) and S(xo) are calculated as follows: 

(n(x )) / , S , (7(-),^ ) (-)>^ ) 

12) 5 p, 7(x) 
~~ dxP> 
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423 1 9P' [A(x) 7 (x)] 



A(x ) <9xP< 



x=x 



A(x ) p ^VP/ <9xP<-p <9xP 



9 £ ( P D M^'-M} 

P<Pi 



d p 7 (x) 



5xP 



(73) 



x=x 



(here, (a) is due to the generalized Leibniz rule for differentiation of a product of two functions ifTTl p. 104]), 
and 

>H(R) 

ED 





' } (-)>, 


aP'r^'^x) 




5xP' 


x=x 


a p( (a pi 'ii(x,x2 


5xP' ( <9x Pi ' 


1 9 P!+Pi 'A(x) 



HQ 



A(x ) 9xP'+P'' 
E Xo {0 Pi+p "(y)} 



x 2 =x 



x=x 



x=x 



(74) 



Note that the application of (|2TT) was based on the differentiability of (cf. Theorem IIII.6I ). Comparing 

d72l ). (l73l) . and (1741) with d53l . d54l) . and (|55T ). respectively, we conclude that the theorem is proved. 



Appendix D 
Proof of Lemma IVL41 

For £^ = (X, /^(yjx), and xo G X, consider a function 7Q : X — > R belonging to the 
RKHS %£(A) Xo . By Theorem IIII.21 the function c(-) = ■y(-) — g(-) is a valid bias function for E^> = 
(X, f( A \y; x), #(•)) at xo; furthermore, the LMV estimator at xo exists and is given by g^°\-) = J[/y(-)]. 
Trivially, this estimator has the finite variance v^g^°\-); xo) = M(c(-),xo) at xo and its mean function equals 
7Q, i.e., E x {g( x °)(y)} = 7(x) for all xeAf. Hence, the mean power E x { (g < - Xo - ) (y)) 2 } is finite at xo, since 

E Xo {(5 (xo) (y)) 2 } = ^ (xo) (y);x ) + (E Xo {s (xo) (y)}) 2 = M(c(.),x ) + 7 2 (x ) < oo. (75) 

Now, for any exponential family based estimation problem = (y, f^ A \y\ x), #(•)) > it follows from 
BUI Theorem 2.7] that the mean function E x {^(-)} of any estimator <?(•) is analytic^ on the interior T° of the 
set T = {x G J\f\ E x {|g(y)|} < 00}. Furthermore, T can be shown to be a convex set BUI Corollary 2.6]. In 
particular, the mean function 7(x) of the LMV estimator 5^ Xo - l (-) is analytic on the interior 7^° of the convex 

6 Following 1121 Definition 2.2.1], we call a real-valued function /(■) : U — > R defined on some open domain U C M. N analytic if 
for every point x c £ U there exists a power series 5Z P gz N a p( x — x c) p converging to /(x) for every x in some neighborhood of x c . 
Note that the coefficients a p may vary with x c . 
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set To — {xGA/"| E x {jg( x °)(y)|} < 00} . We will now verify that X C To- By a reasoning similar to the proof 
of Theorem IIII.5I in Appendix A, again using the Hilbert space T-i Xo — {*(y)| E Xo {i 2 (y)} < 00} and associated 
inner product (ii(y), ^(y))^ = E Xo {ii(y)i2(y)}, we obtain for an arbitrary x G X C TV 



E x {|5o(y)|} = Ex o{|5 (xo) (y)|^^y} 

= <|5 (xo) (y)|,p(y,x)) RV 




s (Xo) (y) |,|s (xo) (y) I > RV <p(y > x ) .p(y > x ) > RV 

< 00 , 

where (a) follows from the Cauchy-Schwarz inequality in the Hilbert space 7i Xo . Thus, we have verified that 
XQTo- Moreover, we have 



X C 7?. (76) 



liecQ' 



This is impliedj by A" C To together with the fact that (by assumption) X is an open set. 
Let us now consider the restrictions 

77^0) - 7(axi + (l-a)xo) , ae(-e,l+e), (77) 

of 7(-) on line segments of the form 1Z Xl = {axi + (1 — a)xo | a G (— e, 1 + e)}, where xi G 7^° and e > 0. 
Here, e is chosen sufficiently small such that the vectors x a = xo — e(xi — xo) and x^ = xi + e(xi — xo) 
belong to T °, i.e., x a ,Xfe G T °. Such an e can always be found, since — due to d76l ) — we have xo G T °. As 
can be verified easily, any vector in lZ Xl is a convex combination of the vectors x a and X{,, which both belong 
to the interior T ° of the convex set 7o- Therefore we have lZ Xl C T ° for any xi G T °, as the interior T ° of the 
convex set To is itself a convex set [42, Theorem 6.2][] i.e., the interior T ° contains any convex combination 
of its elements. 

The function 77^ (•) : (— e, 1 + e) — > M in (1771 ) is the composition of the mean function 7(-) : X — > R, which 
is analytic on T ° C with the vector-valued function b(-) : (— e, 1 + e) — > T§ given by b(a) = axi+(l— a)xo. 
Since each component of the function b(-), whose domain is the open interval (— e, 1 + e), is an analytic 
function, the function 77^ (•) is itself analytic |[T2l Proposition 2.2.8]. 

'indeed, assume that the open set X C 7o contains a vector x' G A" that does not belong to the interior Tq . It follows that no single 

neighborhood of x' can be contained in 7o and, thus, no single neighborhood of x' can be contained in X, since X C To- However, 

because x' belongs to the open set X = X°, there must be at least one neighborhood of x' that is contained in X. Thus, we arrived at 

a contradiction, which implies that every vector x' G X must belong to 7o°, or, equivalently, that X C Tq. 

8 Strictly speaking, 1421 Theorem 6.2] states that the relative interior of a convex set is a convex set. However, since we assume that 

X is open with non-empty interior and therefore, by M6\ , also 7o has a nonempty interior, the relative interior of 7o coincides with 

the interior of To- 
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Since the partial derivatives of 7(-) at xo, g x p | x _ x , are assumed to vanish for every p € Z+, the 
(ordinary) derivatives of arbitrary order of the scalar function 77^ xi (a) vanish at a = (cf. iTTTl Theorem 9.15]). 
According to [12, Corollary 1.2.5], since 77e xi (a) is an analytic function, this implies that 7-^ xl (a) vanishes 
everywhere on its open domain (— e, 1 + e). This, in turn, implies that 7(-) vanishes on every line segment lZ Xl 
with some xi and, thus, 7(-) vanishes everywhere on Tq. By (l76l ). we finally conclude that 7(-) vanishes 
everywhere on X. 

Appendix E 
Proof of Theorem IVI.5I 

Because c(-) was assumed valid at xo, the corresponding mean function 7(-) = c(-) + g(-) is an element 
of V.£(A)^ (see Theorem IIII.2I ). Let 7i(-) = 7(0 1# > anc ^ note tnat 7l(") * s m ean function corresponding 
to the restricted bias function ci(-), i.e., 7i(-) = ci(-) + 5'(-)|^ ■ We have 7i(-) € %£(^> Xo due to Theorem 
IIII.2I because 71 (x) is the mean function (evaluated for x £ X{) of an estimator g(-) that has finite variance 
at xo and whose bias function on X equals c(x). (The existence of such an estimator g(-) is guaranteed since 
c(-) was assumed valid at xo.) For the minimum achievable variance for the restricted estimation problem, we 
obtain 

M^i-Uo) ® H71OI& (A) -7?(xo)^ min \W(-)f H , A) - 7?(xo) . (78) 

7'(-)| =7iO 

However, the only function 7'(-) € %£C4) jXo tnat satisfies 7 / (-)|^ = 7l(0 i s the mean function 7(-). This 
is a consequence of Lemma IVl.41 and can be verified as follows. Consider a function 7'(-) € %£(.a) Xo that 

satisfies 7 / ( - )|^ = 7l(0- By tne definition of 7i(-)> we also have 7(-)|^ = 7i(')- Therefore, the difference 
7 "(.) 4 y ( .) _ 7( .) G H£(A)Ka satisfies = y ( .)|^ _ 7 (.)|^ = 7l( .) _ 7l( .) = 0> Lb>> y> (x) = for 

all x£4 Since xo G X°, this implies that ~5xi^| x _ x = for all p S Z+. It then follows from Lemma 
IVI.4I that 7"(x) = for all x € X and, thus, 7'(x) = 7(x) for all x € X. This shows that 7(-) is the unique 
function satisfying 7(-)| A - = 7i(0- Therefore, we have 

min IIVOIlL^j = h(-)\\l U) , 

7'(-)L =7i(0 

and thus (|78T ) becomes 

Mi(ci(-),xo) = ||7(-)ll« £(A)xo -7i 2 (x ) = || 7 (-)llw £( ^ o -7 2 (xo) ^ M(c(-),x ). 
Here, the second equality is due to the fact that 71 (xq) = 7(xq) (because xq G X°). 



32 



References 

[1] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995. 

[2] E. Parzen, "Statistical inference on time series by Hilbert space methods, I." Appl. Math. Stat. Lab., Stanford University, Stanford, 
CA, Tech. Rep. 23, Jan. 1959. 

[3] D. D. Duttweiler and T. Kailath, "RKHS approach to detection and estimation problems - Part V: Parameter estimation," IEEE 

Trans. Inf. Theory, vol. 19, no. 1, pp. 29-37, Jan. 1973. 
[4] S. Schmutzhard, A. Jung, F. Hlawatsch, Z. Ben-Haim, and Y. C. Eldar, "A lower bound on the estimator variance for the sparse 

linear model," in Proc. 44th Asilomar Conf. Signals, Systems, Computers, Pacific Grove, CA, Nov. 2010, pp. 1976-1980. 
[5] S. Schmutzhard, A. Jung, and F. Hlawatsch, "Minimum variance estimation for the sparse signal in noise model," in Proc. IEEE 

ISIT2011, St. Petersburg, Russia, Jul.-Aug. 2011, pp. 124-128. 
[6] A. Jung, S. Schmutzhard, F. Hlawatsch, and A. O. Hero III, "Performance bounds for sparse parametric covariance estimation in 

Gaussian models," in Proc. IEEE ICASSP 2011, Prague, Czech Republic, May 2011, pp. 4156^1159. 
[7] T. Kailath, "RKHS approach to detection and estimation problems - Part I: Deterministic signals in Gaussian noise," IEEE Trans. 

Inf. Theory, vol. 17, no. 5, pp. 530-549, Jan. 1971. 
[8] A. Jung, "An RKHS Approach to Estimation with Sparsity Constraints," Ph.D. dissertation, Vienna University of Technology, 

2011. 

[9] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins University Press, 1996. 
[10] B. R. Gelbaum and J. M. Olmsted, Counterexamples in Analysis. Mineola, NY: Dover Publications, 2003. 
[11] W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York: McGraw-Hill, 1976. 
[12] S. G. Krantz and H. R. Parks, A Primer of Real Analytic Functions, 2nd ed. Boston, MA: Birkhauser, 2002. 
[13] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed. New York: Springer, 1998. 

[14] P. R. Halmos and L. J. Savage, "Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics," Ann. Math. 

Statist., vol. 20, no. 2, pp. 225-241, 1949. 
[15] I. A. Ibragimov and R. Z. Has'minskii, Statistical Estimation. Asymptotic Theory. New York: Springer, 1981. 
[16] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Englewood Cliffs, NJ: Prentice Hall, 1993. 
[17] Y. C. Eldar, Rethinking Biased Estimation: Improving Maximum Likelihood and the Cramer-Rao Bound, ser. Foundations and 

Trends in Signal Processing. Hanover, MA: Now Publishers, 2007, vol. 1, no. 4. 
[18] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. Pacific Grove, CA: Duxbury, 2002. 
[19] N. Aronszajn, "Theory of reproducing kernels," Trans. Am. Math. Soc, vol. 68, no. 3, pp. 337-404, May 1950. 
[20] W. Rudin, Real and Complex Analysis, 3rd ed. New York: McGraw-Hill, 1987. 

[21] E. W. Barankin, "Locally best unbiased estimates," Ann. Math. Statist., vol. 20, no. 4, pp. 477-501, 1949. 
[22] C. Stein, "Unbiased estimates with minimum variance," Ann. Math. Statist., vol. 21, no. 3, pp. 406-415, 1950. 
[23] P. R. Halmos, Measure Theory. New York: Springer, 1974. 

[24] H.-W. Sun and D.-X. Zhou, "Reproducing kernel Hilbert spaces associated with analytic translation-invariant Mercer kernels," J. 

Fourier Anal. Appl, vol. 14, no. 1, pp. 89-101, 2008. 
[25] D.-X. Zhou, "Derivative reproducing properties for kernel methods in learning theory," J. Fourier Anal. Appl, pp. 456-463, 2008. 

[26] , "Capacity of reproducing kernel spaces in learning theory," IEEE Trans. Inf. Theory, vol. 49, pp. 1743-1752, 2003. 

[27] R. McAulay and E. Hofstetter, "Barankin bounds on parameter estimation," IEEE Trans. Inf. Theory, vol. 17, no. 6, pp. 669-676, 

Nov. 1971. 

[28] H. Cramer, "A contribution to the theory of statistical estimation," Skand. Akt. Tidskr., vol. 29, pp. 85-94, 1946. 
[29] C. R. Rao, "Information and the accuracy attainable in the estimation of statistical parameters," Bull. Calcutta Math. Soc, vol. 37, 
pp. 81-91, 1945. 



33 



[30] P. Stoica and B. C. Ng, "On the Cramer-Rao bound under parametric constraints," IEEE Signal Processing Letters, vol. 5, no. 7, 
pp. 177-179, Jul. 1998. 

[31] Z. Ben-Haim and Y. Eldar, "On the constrained Cramer-Rao bound with a singular Fisher information matrix," IEEE Signal 

Processing Letters, vol. 16, no. 6, pp. 453-456, June 2009. 
[32] T. J. Moore, "A theory of Cramer-Rao bounds for constrained parametric models," Ph.D. dissertation, University of Maryland, 

2010. 

[33] J. S. Abel, "A bound on mean-square-estimate error," IEEE Trans. Inf. Theory, vol. 39, no. 5, pp. 1675-1680, Sep. 1993. 

[34] A. Bhattacharyya, "On some analogues of the amount of information and their use in statistical estimation," Shankya: The Indian 

Journal of Statistics (1933-1960), vol. 8, no. 1, pp. 1-14, Nov. 1946. 
[35] J. D. Gorman and A. O. Hero, "Lower bounds for parametric estimation with constraints," IEEE Trans. Inf. Theory, vol. 36, no. 6, 

pp. 1285-1301, Nov. 1990. 

[36] D. G. Chapman and H. Robbins, "Minimum variance estimation without regularity assumptions," Ann. Math. Statist., vol. 22, 

no. 4, pp. 581-586, Dec. 1951. 
[37] J. M. Hammersley, "On estimating restricted parameters," J. Roy. Statist. Soc. B, vol. 12, no. 2, pp. 192-240, 1950. 
[38] Z. Ben-Haim and Y. C. Eldar, "The Cramer-Rao bound for estimating a sparse parameter vector," IEEE Trans. Signal Processing, 

vol. 58, pp. 3384-3389, June 2010. 
[39] S. Kullback, Information Theory and Statistics. Mineola, NY: Dover Publications, 1968. 

[40] L. D. Brown, Fundamentals of Statistical Exponential Families, ser. Lecture Notes - Monograph Series. Hayward, CA: Institute 
of Mathematical Statistics, 1986. 

[41] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families, and Variational Inference, ser. Foundations and 

Trends in Machine Learning. Hanover, MA: Now Publishers, 2008, vol. 1, no. 1-2. 
[42] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton Univ. Press, 1970. 



